Raoul Grouls
when we write:
\(X = \{x | x \in \mathbb{R}\}\)
this means:
\(X\) is a collection of numbers \(x\), and every \(x\) is part of the real numbers.
Observations usually have multiple features. E.g. we observe temperature, humidity, wind velocity.
These are all real numbers, and we can describe the state of the weather with these three dimensions.
We can write this down as:
\(x \in \mathbb{R}^3\).
or more general:
\(x \in \mathbb{R}^d\) for \(d\) dimensions,
or more explicit:
\(X = \{x_1, \dots, x_m\}\) for \(m\) features, where \(x_i \in \mathbb{R}^d\)
Functions map collections of numbers to other collections.
For example:
This means we hope to find a function:
\(f \colon \mathbb{R}^5 \to \mathbb{R}\).
for example, a linear model:
\(f(X) = w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4 + w_5 x_5\).
Linear models are one of the simplest models available. Their advantages are:
The basic shape of a linear regression is a line (in \(\mathbb{R}^2\)):
\(Y = W X + b\)
The basic shape of a linear regression is a line:
\(Y = W X + b\)
For classification models, we can still use a linear model. Only now we try to separate values, instead of getting them on the line.
The model is still: \(Y = W X + b\)
\[Y = \begin{cases} \text{"buy" if } y \geq 0\\ \text{"sell" if } y \geq 0\\ \end{cases} \]
We can use a linear model for both cases. We call these models “hyperplanes”: they have one dimension less than the ambient space.
With classification, we want the data to be separated by the hyperplane. With regression, we want to points to get as close as possible to the hyperplane
Which model would you prefer?
We would want to have some sort of safety margin, a maximal “dead zone” with a minimal amount of “violations”.
If our labels are \(y = \{1, -1\}\) we would want \(y (w x_i + b) \geq 1\) to be true.
But let’s say \(x_1\) is a little bit inside the margin.
And \(x_2\) might be completely on the wrong side of the margin. How to account for these “violations” of our “safe border”?
We assign a “slack value” \(\xi_1\) and \(\xi_2\) to these errors. These values are what is needed to “correct” the errors.
In the end, we will add \(C \Sigma_i \xi_i\) to the loss function. \(C\) is a value we pick (e.g. 1), and we simply sum up all the “slack” values.
We will then prefer the model with the lowest loss.
A lot of data is non-linear. This means we would need a curved hyperplane.
One trick to do this is the kernel-trick, which is commonly used with Support Vector Machines, which we touched upon in the short history.
Deep learning has found another trick, which will be covered in the deep learning course.
Let’s say all you have is a scissor, and you only want to cut once.
Now you have a towel and need to cut off (“classify”) the four corners. How would you do this?
Well, that seems obvious: you bend the towel!
Now, let’s transfer this solution to our mathematical problem:
So how do we “obtain” these extra dimensions? Let’s start with data that has two features: \(X = (x_1, x_2)\)
We could transform this into 3 dimensions by adding a square:
\(\phi_3(X) = (x_1, x_2, x_1 x_2)\)
or even five dimensions:
\(\phi_5(X) = (x_1, x_2, x_1 x_2, x_1^2, x_2^2)\)
\(\phi_5(X) = (x_1, x_2, x_1 x_2, x_1^2, x_2^2)\)
With numbers:
\(X = (2, 3)\)
\(\phi_5(X) = (2, 3, 6, 4, 9)\)
A simplified definition of a kernel is:
a symmetric function that - takes two inputs - returns a value of zero if the inputs are the same, or positive otherwise
An often used kernel is the Gaussian kernel:
\(K(x, x') = exp(-\gamma ||x - x'||^2)\)
There are a lot of kernels:
Using basis expansion might seem a nice trick. But it can be much too powerful…
With too much variables, your model will become too complex very quickly.
The image on the left is a polynomial basis expansion to \(x^{100}\). As you can see, the model is way to complex.
Support Vector Machines will often protect you against overfitting, because of their preference for models with a large “safety border”
However, with a SVM we will need to pick optimal hyperparameters for the model:
This means we will need to do some hyperparameter tuning.
While picking the right model with the right hyperparameters can protect you against overfitting, an important measure against overfitting is a train-test split. If the model only “remembers” the trainset, it will fail at the test and validation sets.
Typically, we use three sets:
A general ratio would be 80-10-10 for train-test-valdation.
[1] Peng, Junjie & Jury, Elizabeth & Dönnes, Pierre & Ciurtin, Coziana. (2021). Machine Learning Techniques for Personalised Medicine Approaches in Immune-Mediated Chronic Inflammatory Diseases: Applications and Challenges. Frontiers in Pharmacology. 12. 10.3389/fphar.2021.720694.